In this final lecture, we will look at a number of useful applications of the techniques discussed in previous weeks. These all come from the worlds of finance and economics, mostly because of the abundance of data. We will also be using Wes McKinney's template .ipynb
file as a skeleton for the lectures.
Throughout, we will make use of some technical jargon. Here are frequent terms that you should be familiar with:
In [ ]:
from __future__ import division
from pandas import Series, DataFrame
import pandas as pd
from numpy.random import randn
import numpy as np
pd.options.display.max_rows = 12
np.set_printoptions(precision=4, suppress=True)
import matplotlib.pyplot as plt
In [ ]:
%matplotlib inline
In previous lectures we have discussed a number of useful tools that are present within the Python data analysis ecosystem for data munging. Here we will overview a few of these tools in action.
Suppose you are given two different financial datasets on which you hope to perform a reasonable analysis. More likely than not, the series may have indices that don't line up perfectly, or (if the data is stored into a DataFrame
) might have columns or row labels that don't match.
This is a terribly frequent and frustrating problem in data analysis, so much so that it has its own name: the data alignment problem. In Pandas
, basic arithmetic operations between data sets performs automatic alignment. For example, let us consider the following datasets.
The stock_px.csv
dataset can be downloaded here and the volume.csv
dataset can be downloaded here.
In [ ]:
close_px = pd.read_csv('stock_px.csv', parse_dates=True, index_col=0)
volume = pd.read_csv('volume.csv', parse_dates=True, index_col=0)
prices = close_px.ix['2011-09-05':'2011-09-14', ['AAPL', 'JNJ', 'SPX', 'XOM']]
volume = volume.ix['2011-09-05':'2011-09-12', ['AAPL', 'JNJ', 'XOM']]
In [ ]:
prices
In [ ]:
volume
One useful metric in the financial analysis of stock prices is the volume-weighted average price, or VWAP, of a stock. This metric weights the price of a stock at any given time by the amount of trades are occurring at that time. The idea is that high-volume trades give a better indication of the perceptions of institutional investors, who presumably have expert understanding, of the value of the stock.
Since Pandas
aligns the data automatically and excludes missing data in functions like sum
, we can express this concisely as:
In [ ]:
prices * volume
In [ ]:
vwap = (prices * volume).sum() / volume.sum()
In [ ]:
vwap
Because no volume data is given for the SPX exchange-traded fund, the VWAP value for the asset is consequently absent. We can remove it from the above series by a simple call to dropna
.
In [ ]:
vwap.dropna()
Manual alignment can be achieved by using the align
method built-in to DataFrame
objects.
In [ ]:
prices.align(volume, join='inner')
You can also build DataFrame
objects with Series
objects that may potentially have different individual shapes without a hitch, because of Pandas
automatic alignment.
In [ ]:
s1 = Series(range(3), index=['a', 'b', 'c'])
s2 = Series(range(4), index=['d', 'b', 'c', 'e'])
s3 = Series(range(3), index=['f', 'a', 'c'])
DataFrame({'one': s1, 'two': s2, 'three': s3})
Again, NaN
is assigned to values for which the original Series
do not cover respectively.
You can specify explicitly the index of the result, discarding the rest:
In [ ]:
DataFrame({'one': s1, 'two': s2, 'three': s3}, index=list('face'))
The Federal Reserve publishes new GDP data every quarter, but only publishes inflation data annually. Publicly listed firms are required to provide income and balance sheet data quarterly, but it need not happen at the same day for every company. These types of problems for data analysts fall under the category of time series frequency problems, and Pandas
provides a number of techniques for solving them.
Suppose you have a time series that contains data compiled weekly (on Wednesdays):
In [ ]:
ts1 = Series(np.random.randn(3),
index=pd.date_range('2012-6-13', periods=3, freq='W-WED'))
ts1
In certain circumstances, it might be useful to resample the data to different frequencies. For example, if you need to resample the data so that an entry is provided for every business day in the period, you can employ:
In [ ]:
ts1.resample('B')
Notice here that because no new information is provided, all of the new dates are simply left NaN
. If you want to fill these gaps with prevous data, you can apply various fillers with the fill_method
parameter specified.
In [ ]:
ts1.resample('B', fill_method='ffill')
This remedy is an elegant solution to upsampling from lower frequency data to higher frequency data, but another class of frequency problems involve irregular time series data, in which the above methods will not work as neatly. Consider the following data set containing irregularly sampled data:
In [ ]:
dates = pd.DatetimeIndex(['2012-6-12', '2012-6-17', '2012-6-18',
'2012-6-21', '2012-6-22', '2012-6-29'])
ts2 = Series(np.random.randn(6), index=dates)
ts2
If you want to add the "as of" values in ts1
to ts2
, one option would be to resample both to a regular frequency and then add, but if you want to maintain the date index in ts2
, using reindex
is a more precise solution:
In [ ]:
ts1.reindex(ts2.index, method='ffill')
In [ ]:
ts2 + ts1.reindex(ts2.index, method='ffill')
Periods, as opposed to timestamps, are another common approach to organizing time series data, which can lead to its own time series frequency problems. Suppose you have the following GDP and inflation data, which are both periodic but at different frequencies:
In [ ]:
gdp = Series([1.78, 1.94, 2.08, 2.01, 2.15, 2.31, 2.46],
index=pd.period_range('1984Q2', periods=7, freq='Q-SEP'))
infl = Series([0.025, 0.045, 0.037, 0.04],
index=pd.period_range('1982', periods=4, freq='A-DEC'))
gdp
In [ ]:
infl
In Pandas
, unlike with timestamps, operations between different-frequency time series that are indexed by periods are not possible without explicit conversions. In this case, if we know that infl
values were observed at the end of the year, we can then convert to Q-SEP
to get the right periods in that frequency:
In [ ]:
infl_q = infl.asfreq('Q-SEP', how='end')
infl_q
Then, we can simply reindex the inflation time series data with a forward-filling method to match the GDP data.
In [ ]:
infl_q.reindex(gdp.index, method='ffill')
Suppose you have a long time series containing intraday market data and you want to extract the prices at a particular time of day on each day of the data. What if the data are irregular such that observations do not fall exactly on the desired time? In practice this task can make for error-prone data munging if you are not careful. Here is an example for illustration purposes:
In [ ]:
# Make an intraday date range and time series
rng = pd.date_range('2012-06-01 09:30', '2012-06-01 15:59', freq='T')
# Make a 5-day series of 9:30-15:59 values
rng = rng.append([rng + pd.offsets.BDay(i) for i in range(1, 4)])
ts = Series(np.arange(len(rng), dtype=float), index=rng)
ts
We can index ts
with a datetime.time
object to extract the values at 10:00 AM throughout the entire time series.
In [ ]:
from datetime import time
ts[time(10, 0)]
Alternatively, you can specify timestamps between two time intervals, such as 10:00 AM and 10:01 AM (inclusively).
In [ ]:
ts.between_time(time(10, 0), time(10, 1))
In [ ]:
np.random.seed(12346)
However, it could be the case that no data actually fell exactly at 10:00 AM, but you might want to know the last known value at 10:00. In that case, you might do something like the following:
In [ ]:
# Set most of the time series randomly to NA
indexer = np.sort(np.random.permutation(len(ts))[700:])
irr_ts = ts.copy()
irr_ts[indexer] = np.nan
irr_ts['2012-06-01 09:50':'2012-06-01 10:00']
In [ ]:
selection = pd.date_range('2012-06-01 10:00', periods=4, freq='B')
irr_ts.asof(selection)
The asof
method performs this functionality for you.
In financial and economic contexts, there are a number of widely occurring use cases of merging related datasets:
In the first case, switching from one set of time series to another at a specific instant, it is a matter of splicing together two TimeSeries
or DataFrame
objects using pandas.concat
:
In [ ]:
data1 = DataFrame(np.ones((6, 3), dtype=float),
columns=['a', 'b', 'c'],
index=pd.date_range('6/12/2012', periods=6))
data2 = DataFrame(np.ones((6, 3), dtype=float) * 2,
columns=['a', 'b', 'c'],
index=pd.date_range('6/13/2012', periods=6))
spliced = pd.concat([data1.ix[:'2012-06-14'], data2.ix['2012-06-15':]])
spliced
Suppose in a similar example that data1
was missing a time series present in data2
:
In [ ]:
data2 = DataFrame(np.ones((6, 4), dtype=float) * 2,
columns=['a', 'b', 'c', 'd'],
index=pd.date_range('6/13/2012', periods=6))
spliced = pd.concat([data1.ix[:'2012-06-14'], data2.ix['2012-06-15':]])
spliced
Using combine_first
, you can bring in data from before the splice point to extend the history for d
item:
In [ ]:
spliced_filled = spliced.combine_first(data2)
spliced_filled
Since there isn't any data for d
at 2012-06-12 in data2
, the entry receives an NaN
.
The update
method for DataFrame
objects performs in-place updates to the data. To fill in only holes in the data, pass in overwrite=False
.
In [ ]:
spliced.update(data2, overwrite=False)
In [ ]:
spliced
You can also perform data replacement on any subset of symbols. Below, this is used to fill in values for specific columns:
In [ ]:
cp_spliced = spliced.copy()
cp_spliced[['a', 'c']] = data1[['a', 'c']]
cp_spliced
The return of a financial asset is defined as the cumulative percent change in its price.
Here is some price data for Apple, Inc.:
In [ ]:
import pandas.io.data as web
price = web.get_data_yahoo('AAPL', '2011-01-01')['Adj Close']
price[-5:]
Apple does not pay dividends, which would otherwise be included into the calculation of net returns. Thus, a quick-and-dirty computation of returns will suffice:
In [ ]:
price['2011-10-03'] / price['2011-3-01'] - 1
Stocks with dividend payments complicate the computation because you have to factor in the stream of payments over time. The Adj Close
attempts to adjust for stock splits and dividends, but in any case, it's quite common to derive a return index, which indicates the value of a dollar's investment into the asset. For Apple, let's compute a simple return index using cumprod
.
In [ ]:
returns = price.pct_change()
ret_index = (1 + returns).cumprod()
ret_index[0] = 1 # Set first value to 1
ret_index.plot(grid=True)
With a return index, we can manipulate the frequency at which we compute the returns quite easily.
In [ ]:
m_returns = ret_index.resample('BM', how='last').pct_change()
m_returns['2012']
Since no dividends or other adjustments are considered, we could have alternatively computed from the daily percent changed by resampling with a simple aggregation:
In [ ]:
m_rets = (1 + returns).resample('M', how='prod', kind='period') - 1
m_rets['2012']
Then, to include the dividend payments you can simply add the separate dividend payment data as follows:
returns[dividend_dates] += dividend_pcts
Let's consider a collection of made-up assets. We first generate a universe of 1000 tickers:
In [ ]:
pd.options.display.max_rows = 100
pd.options.display.max_columns = 10
np.random.seed(12345)
In [ ]:
import random; random.seed(0)
import string
N = 1000
def rands(n):
choices = string.ascii_uppercase
return ''.join([random.choice(choices) for _ in xrange(n)])
tickers = np.array([rands(5) for _ in xrange(N)])
Then, we can create a DataFrame
containing three columns representing random portfolios for a given subset of the above tickers:
In [ ]:
M = 500
df = DataFrame({'Momentum' : np.random.randn(M) / 200 + 0.03,
'Value' : np.random.randn(M) / 200 + 0.08,
'ShortInterest' : np.random.randn(M) / 200 - 0.02},
index=tickers[:M])
df.head()
We can aggregate these random tickers by industry. In this simple example, let's use two industries: financial and technology. We can store the mapping as a Series
object.
In [ ]:
ind_names = np.array(['FINANCIAL', 'TECH'])
sampler = np.random.randint(0, len(ind_names), N)
industries = Series(ind_names[sampler], index=tickers,
name='industry')
Using groupby mechanics, we can group industries
and carry out group aggregation and transformations:
In [ ]:
by_industry = df.groupby(industries)
by_industry.mean()
Of course, remember the handy describe
method:
In [ ]:
by_industry.describe()
We can transform these portfolios along a particular industry by defining customized transformation functions. For example, standardizing within industry is widely used in equity portfolio research:
In [ ]:
# Within-Industry Standardize
def zscore(group):
return (group - group.mean()) / group.std()
df_stand = by_industry.apply(zscore)
You can verify that each industry has mean (very nearly) 0 and standard deviation 1:
In [ ]:
df_stand.groupby(industries).agg(['mean', 'std'])
Other, built-in kinds of transforms, such as rank
, can be used to make the analysis more concise.
In [ ]:
# Within-industry rank descending
ind_rank = by_industry.rank(ascending=False)
ind_rank.groupby(industries).agg(['min', 'max'])
In quantitative equity, "rank and standardize" is a common sequence of transformations. You can compose these operations concisely with a well-placed lambda, as follows:
In [ ]:
# Industry rank and standardize
by_industry.apply(lambda x: zscore(x.rank())).head()
Quantitative portfolio management takes heavy advantage of factor analysis. From Wikipedia:
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in four observed variables mainly reflect the variations in two unobserved variables. Factor analysis searches for such joint variations in response to unobserved latent variables.
Portfolio holdings and performance are decomposed using one or more factors represented as a portfolio of weights. A common example is a stock's beta, which measures co-movement between a stock and a benchmark (like the S&P 500). We can consider a contrived example of a portfolio constructed from three randomly-generated factors (usually called factor loadings) and some weights:
In [ ]:
from numpy.random import rand
fac1, fac2, fac3 = np.random.rand(3, 1000)
ticker_subset = tickers.take(np.random.permutation(N)[:1000])
# Weighted sum of factors plus noise
port = Series(0.7 * fac1 - 1.2 * fac2 + 0.3 * fac3 + rand(1000),
index=ticker_subset)
factors = DataFrame({'f1': fac1, 'f2': fac2, 'f3': fac3},
index=ticker_subset)
Vector correlations between each factor and the portfolio may not indicate too much:
In [ ]:
factors.corrwith(port)
The standard method to compute factor exposure is by least squares regression. You can do so with a number of Python libraries, from SciPy
and NumPy
to more advanced libraries such as statsmodels
. However, Pandas
makes the process particularly easy with its pandas.ols
method.
In [ ]:
pd.ols(y=port, x=factors).beta
Compare these with the original factor weights that were provided above arbitrarily, and you will see that this regression performed considerably better than the corrwith
results; we have almost completely recovered the weights. With groupby
you can compute exposures industry by industry. To do so, encapsulate the regression method with a new function, such as:
In [ ]:
def beta_exposure(chunk, factors=None):
return pd.ols(y=chunk, x=factors).beta
Then, simply group by industries
and apply the function, passing the DataFrame
of factor loadings:
In [ ]:
by_ind = port.groupby(industries)
exposures = by_ind.apply(beta_exposure, factors=factors)
exposures.unstack()
In many circumstances it is useful to break down data based on sample quantiles. For example, the performance of a stock portfolio could be separated into quartiles based on each stock's price-to-earnings ratio. With Pandas
, the method pandas.qcut
combined with groupby
makes this a straightforward task.
Consider a simple trend following or momentum strategy trading the S&P 500 index via the SPY exchange-traded fund.
In [ ]:
import pandas.io.data as web
data = web.get_data_yahoo('SPY', '2006-01-01')
data.info()
We compute the daily returns and a function for transforming the returns into a trend signal formed by a lagged moving sum:
In [ ]:
px = data['Adj Close']
returns = px.pct_change()
def to_index(rets):
index = (1 + rets).cumprod()
first_loc = max(index.index.get_loc(index.idxmax()) - 1, 0)
index.values[first_loc] = 1
return index
def trend_signal(rets, lookback, lag):
signal = pd.rolling_sum(rets, lookback, min_periods=lookback - 5)
return signal.shift(lag)
Using this function, we can create and test a trading strategy that trades this momentum signal every Friday:
In [ ]:
signal = trend_signal(returns, 100, 3)
trade_friday = signal.resample('W-FRI').resample('B', fill_method='ffill')
trade_rets = trade_friday.shift(1) * returns
trade_rets = trade_rets[:len(returns)]
We can then convert the strategy returns to a return index and plot them:
In [ ]:
to_index(trade_rets).plot(grid=True, figsize=(12,6))
Caveat: this is a naive strategy!
Suppose that you want to decompose the strategy performance into more and less volatile periods of trading. Trailing one-year annualized standard deviation is a simple measure of volatility, and we can compute Sharpe ratios to assess the reward-to-risk ratio in various volatility regimes:
In [ ]:
vol = pd.rolling_std(returns, 250, min_periods=200) * np.sqrt(250)
def sharpe(rets, ann=250):
return rets.mean() / rets.std() * np.sqrt(ann)
Now we can divide vol
into quartiles with pd.qcut
and aggregating with sharpe
:
In [ ]:
cats = pd.qcut(vol, 4)
print('cats: %d, trade_rets: %d, vol: %d' % (len(cats), len(trade_rets), len(vol)))
In [ ]:
trade_rets.groupby(cats).agg(sharpe)
These results show that the strategy performed the best during the period when the volatility was the highest.
From Python for Data Analysis:
In practice, modeling and trading futures contracts on equities, currencies, commodities, bonds, and other asset classes is complicated by the time-limited nature of each contract. For example, at any given time for a type of future (say silver or copper futures) multiple contracts with different expiration dates may be traded. In many cases, the future contract expiring next (the near contract) will be the most liquid (highest volume and lowest bid-ask spread.
For the purposes of modeling and forecasting, it can be much easier to work with a continuous return index indicating the profit and loss associated with always holding the near contract. Transitioning from an expiring contract to the next (or far) contract is referred to as rolling. Computing a continuous future series from the individual contract data is not necessarily a straightforward exercise and typically requires a deeper understanding of the market and how the instruments are traded. For example, in practice when and how quickly would you trade out of an expiring contract and into the next contract?
We will go into one way to do so. Using scaled prices for the SPY exchange-traded fund as a proxy for the S&P 500, we have:
In [ ]:
pd.options.display.max_rows = 10
import pandas.io.data as web
# Approximate price of S&P 500 index
px = web.get_data_yahoo('SPY')['Adj Close'] * 10
px
Now, a little bit of preamble. Let's put a couple of S&P 500 future contracts and expiry dates in a Series
object.
In [ ]:
from datetime import datetime
expiry = {'ESU2': datetime(2012, 9, 21),
'ESZ2': datetime(2012, 12, 21)}
expiry = Series(expiry).order()
expiry
Using Yahoo! Finance prices and a random walk with noise, we can simulate the two contracts into the future.
In [ ]:
np.random.seed(12347)
N = 200
walk = (np.random.randint(0, 200, size=N) - 100) * 0.25
perturb = (np.random.randint(0, 20, size=N) - 10) * 0.25
walk = walk.cumsum()
rng = pd.date_range(px.index[0], periods=len(px) + N, freq='B')
near = np.concatenate([px.values, px.values[-1] + walk])
far = np.concatenate([px.values, px.values[-1] + walk + perturb])
prices = DataFrame({'ESU2': near, 'ESZ2': far}, index=rng)
prices
then has two time series for each contract that differ by a random amount.
In [ ]:
prices.tail()
The technique that we will use to splice these two separate time series into a single continuous series is by constructing a weighting matrix. Active constraints have a weight of 1 until the expiry date gets close. At that point, we decide on a roll convention. Here is a function that computes a weighting matrix with linear decay:
In [ ]:
def get_roll_weights(start, expiry, items, roll_periods=5):
# start : first date to compute weighting DataFrame
# expiry : Series of ticker -> expiration dates
# items : sequence of contract names
dates = pd.date_range(start, expiry[-1], freq='B')
weights = DataFrame(np.zeros((len(dates), len(items))),
index=dates, columns=items)
prev_date = weights.index[0]
for i, (item, ex_date) in enumerate(expiry.iteritems()):
if i < len(expiry) - 1:
weights.ix[prev_date:ex_date - pd.offsets.BDay(), item] = 1
roll_rng = pd.date_range(end=ex_date - pd.offsets.BDay(),
periods=roll_periods + 1, freq='B')
decay_weights = np.linspace(0, 1, roll_periods + 1)
weights.ix[roll_rng, item] = 1 - decay_weights
weights.ix[roll_rng, expiry.index[i + 1]] = decay_weights
else:
weights.ix[prev_date:, item] = 1
prev_date = ex_date
return weights
The weights look like this around the ESU2 expiry:
In [ ]:
weights = get_roll_weights('6/1/2012', expiry, prices.columns)
weights.ix['2012-09-12':'2012-09-21']
Finally, the rolled future returns are just a weighted sum of the contract returns:
In [ ]:
rolled_returns = (prices.pct_change() * weights).sum(1)
From Python for Data Analysis:
Dynamic models play an important role in financial modeling as they can be used to simulate trading decisions over a historical period. Moving window and exponentially-weighted time series functions are an example of tools that are used for dynamic models.
Correlation is one way to look at the co-movement between the changes in two asset time series. panda's
rolling_corr
function can be called with two return series to compute the moving window correlation.
First, let's load some stocks from Yahoo! Finance and compute the daily returns.
In [ ]:
aapl = web.get_data_yahoo('AAPL', '2000-01-01')['Adj Close']
msft = web.get_data_yahoo('MSFT', '2000-01-01')['Adj Close']
aapl_rets = aapl.pct_change()
msft_rets = msft.pct_change()
Then, compute and plot the one-year moving correlation.
In [ ]:
plt.figure()
pd.rolling_corr(aapl_rets, msft_rets, 250).plot(grid=True, figsize=(12,6))
To better capture the differences in volatility, we can use least-squares regression. OLS regression can model the dynamic relationship between a variable and one or more other predictor variables.
In [ ]:
plt.figure()
model = pd.ols(y=aapl_rets, x={'MSFT': msft_rets}, window=250)
model.beta
In [ ]:
model.beta['MSFT'].plot(grid=True, figsize=(12,6))
There are of course more sophisticated techniques than OLS regression, which can be found in the statsmodels
project, but this gives a flavor for the kind of analysis that is possible.